Project overview:

The goal of this project is to create a machine learning algorithm that can be implemented real time into the order process to provide customers with the best delivery date target. The delivery date would be provided to the customer at the time the order is placed.

Problem statement:

The current method of setting the estimated delivery date for customer orders is a manual process that results in ~60% of orders arriving within the delivery window with a high number of orders arriving early to the customer site. Early order arrivals can create issues for customer especially when customers are not prepared to receive the orders.

There have been several attempts to improve delivery date accuracy, but these attempts have resulted in only incremental improvements in our ability to deliver orders within the delivery window.

The goal of this project is to leverage key information about the order, customer, payment process, manufacturing process and logistics network to provide the customer with the most accurate delivery date. The approach will be to understand which variables (over 100 variables are collected at time of order receipt) have the greatest impact on delivery date and to build a model that utilizes the most impactful variables to accurately predict delivery date more than 80% of the time without a negative impact to late deliveries.

Metrics:

The key metrics that will be used to measure success of this project are the following:

  1. Order On-time %: % of orders that are delivered within the delivery date window provided to the customer. The desire is to have this number exceed 80%. This metric will be measured in the following way: (orders that are delivered within the delivery window communicated to the customer/ total orders)
  2. Order Late %: % of orders that are delivered late to the delivery date window provided to the customer. The desire is to have this number below 5%. This metric will be measured in the following way: (qty orders that are delivered late to delivery date communicated to the customer/ total orders)

The analysis below will summarize the key variables as defined by the subject matter experts. The key variables to explore are the following:

  1. dayofweek: This represents the what hour of the day the order was placed
  2. Order_Entry_Hour_of_Day: This represents what day hour of the day the order was placed
  3. Wkly_Scorecard_LOB_SSC_Grouping: High level product category
  4. Product_Desc: This represents the type of product on the order
  5. SSC_Code: This represents if the order is pulled out of existing stock or if it is a custom build order
  6. Delivery_Promise_Flag: This represents which IT system was used to determine the EDD
  7. Direct_Ship_Flag: This represents if the order will be directly shipped to the customer or if it requires to be merged at a merging facility prior to being shipped to the customer
  8. local_ship_code_desc: This represents the delivery method to the customer (e.g. Next Day, 2nd Day, 3-5 Day standard delivery)
  9. ShipTo_State_Province: This represents the US state the order is being delivered to
  10. PAY_1_CD: Represents primary method of payment for the order
  11. QTE_SR_APP_NM: Represents system used to create the quote before the order was submitted
  12. Purchase_Channel: This represents how the customer purchased the order( online, offline, etc…)
  13. second_touch_flag: This represents if the order required a second touch for added services prior to customer shipment which may cause delays
  14. CFI_FLG: Incremental services added in manufacturing prior to shipment to customer
  15. order_revenue_amt: This represents the revenue of the order
  16. Sys_Qty : This represents the system quantity on the order
  17. Build_Facility: This is the facility that the product will be built in. Will be NULL if the product is stocked off the shelf.
  18. Merge_Facility: Determine the facility location where an order requires to be merged with another product prior to delivery to the customer (if merge is required)
  19. Warehousing_Flag: Determines if a customer requested that we build and warehouse the product prior to delivering it
  20. Mfg_Lead_Time: The amount of time needed in manufacturing based on the part in the order that has the longest lead time
  21. Freight_Lead_Time: The amount of time put aside for the logistics leg of the delivery to the customer
  22. Pay_Lead_Time: The amount of time put aside for the customer payment to complete processing
  23. LCL_ORD_SRC_CD: The source system that the order has been placed in
  24. Payment_Type: This represents how the customer will be paying for the order (e.g. credit card …)
  25. Src_Channel : This is the channel number the customer belongs to
  26. SDS_NBD_Flag: Determines if the order needs to be shipped the same day the order is placed or not
  27. CS_Flag: Determines if a custom service is required for the order
  28. Large_Order_Flag: Determines if an order is composed of a large number of systems or high revenue amount

DATA Wrangling.

Below is a list of the steps that were taken to wrangle and prepare the data for analysis:

  1. Pulled in the required data from the SQL server into R
  2. Converted date columns into dates since some of them show up as factors
  3. Needed to create a new column for 1st attempt delivery date based on 2 other columns
    • POI_Delivery_Date_1st_Attempt_SD_DD
    • Delivery_Attempt_1. The first step was to take the date from POI_Delivery_Date_1st_Attempt_SD_DD when it is not null and then take the date from Delivery_Attempt_1 column when the date in POI_Delivery_Date_1st_Attempt_SD_DD is null
  4. Calculated the number of business days delta for the for the following:
    • From Order Date to Estimated Delivery Date (EDD)
    • From Order Date to 1st attempt delivery date
  5. Calculated the business day delta between (Order Date - Delivery Date) - (Order Date - Estimated Delivery Date) to figure out when something is Late how many business days is Late and the same for Early.
  6. Determined if each order is currently Early, Ontime or Late using window range rules

Consistent volume on weekdays with a big drop on weekends (weekends have ~30% of the weekday order volume)

Client BTO consistently takes more time (as expected) thank Client BTS from order to delivery; for Client BTO Wed,Thursday and Friday has lower median but 3rd quartile similar to other days. Saturday is the exception with lower median and 3rd quartile. Possibly due to having more time to build and stage before Monday.

Majority of orders take place between 8AM and 5PM

Another view by hour of Client BTO by hour of day, but see the consistent delta where BTO take more time than BTS as we would expect.

Client BTS has faster delivery time than Client BTO

3 products (Latitude, OptiPlex Desktops and Personal Notebooks) driving ~70% of orders for the population.

BTS has faster delivery time than BTO as expected

Delivery Promise orders is beginning to take on more importance as it now exceeds legacy orders.

Greater number of orders are directly shipped to the customer and not sent to a merge center prior to shipment. We have 29.7K orders with null values for this flag …

For non direct ship flag you see a higher number of business days for both BTO and BTS which is what you would expect since they are not being directly shipped to the customer

Majority of orders ~63% are through 3-5 days standard ground. 2nd Day and Next Day service round out the top 3 for a combined ~90% of all order deliveries.

I found Delivery method to be interesting as I would have expected a bigger gap between 3 to 5 days ship method and 2 day ship method The median is lower on 2 day but the 25% to 75% is very similar. Next Day is lower as you would expect. This may be an area to explore

California , Texas and Florida, and New York are the top 4 states that orders are delivered to.

Florida (East Coast) runs faster than NY, TX and CA possibly due to the location of the distributions centers being closer to the east coast. State/Zip Code is likely to be a an important variable to consider.

Pay Code Comments

Quote System Comments

4 key purchase paths drive the majority of the orders; need to investigate NA values

B2BGC and OB2GC purchase paths for BTS have a higher number of biz days

Very few orders ~1% of orders have a second touch applied to them.

CFI Flag Comments

Majority of Orders are under $2K

As expected Client BTS has lower business days across price points.

Almost 80% of order have only 1 system on them. Interesting data point at system quantity = 48.

It looks like the more systems on an order the more time it takes per the scatter plot

3 Build facilities (CCC6, CTY and PTY) have the longest time to delivery.

BTO is consistently higher in business days to delivery across merge facilities while the FG1 merge facility has the highest number of days to delivery

Those orders that are marked for warehousing have a mean of ~ 9 days longer delivery for BTO and ~ 7 days for BTS orders.

BTS has a larger number of orders with a lower manufacturing lead time while BTO seems to be bi-modal with longer manufacturing lead time than BTS orders.

Client BTO orders are skewed to the right when it comes to the number of days for logistics while BTS is more normally distributed.

Most orders have a payment lead time of 0 days meaning the payment is expected to clear on the same day of the order.

BTS consistently lower number of business days than BTO.

Some differences across payment types. #, 5, P and W payment types stand out.

Some variation by customer local channel. Local channel 41 is interesting as well as 9 for BTS. 85-88 show an increase in BTO when compared to other channels.

As expected SDS shipment has lower business days in BTS but it’s surprising to see how high it is for BTO.

When we have a custom service on the order, it results in a higher number of days to delivery.

Large orders (system quantity or revenue) take longer to deliver.

Mean number of business days from Order Date to 1st attempt Delivery Date is 5.1 Days, Median is 5 days, with 1st quartile at 3 days and 3rd quartile at 6 days. It seems that there may be a binomial distributions (Client BTS vs. Client BTO)

Client BTS has a faster cycle time from order date to 1st attempt delivery date thank Client BTO. Mean days for Client BTS is 2.5 days vs. 6.1 days for Client BTO (median of 2 days vs. 6 days; standard deviation of 3 days vs. 1.5 days)

Early deliveries arrive on average -5.3 business days before the date communicated to the customer (median = - 5 days) standard deb is 3.6 days

Late deliveries arrive on average + 2.7 business days after the date communicated to the customer (median = +2 days) standard dev is 2.5 days.

Feature Selection:

With over 50 features to choose from, the next step was to determine which features had the largest impact on the target variable (quantity of business days from order date to delivery date) and to ensure that the correlation between nominal features was well understood as well as the relation between the categorical variables and the target variable using ANOVA.

After splitting the data into a training set (80% of the observations) and a test set (20% of observations), a Random Forest model was used as a first pass to determine variable importance. From the Random Forest model we see that Mfg_Lead_Time, Build_Facility, SSC_Code, local_ship_code_desc, order_revenue_amt are the top 5 variables in terms of importance to delivery days.

Focusing on the target variable (bizdaysdeliv), we find that Mfg_Lead_Time has a strong correlation to it at 0.67 followed by Warehousing_Flag and second_touch_flag both at 0.43. However, we can also see that Warehousing_Flag and second_touch_flag are highly correlated to one another, and since we met with the business process subject matter experts we determined that Warehousing_Flag would be the one to keep in the model. Another strong correlation between independent features is Sys_Qty to order_revenue_amt and to Large_Order_Flag. Again based on the deep understanding of the process, we will only use Sys_Qty in our prediction model.

ANOVA was used to understand the relation of the categorical variables and the target variable. ANOVA showed that the product related features (Product_Desc, SSC_CD and Wkly_Scorecard_LOB_SSC_Grouping), the shipment method and CFI flag were all big impact items as well.

Modeling:

Since we are trying to predict the number of business days from order date to delivery date, we need to use a regression modeling technique to do this kind of prediction (vs. using a classification modeling technique). The first step used was to spot check several algorithms and determine how well they would perform on the training data. RMSE and R-squared were used as the key metrics in identifying the model(s) that performed best on the training data. Below is a summary of each model’s performance. GBM was found to have the best performance with the lowest RMSE of 1.67 days and an R squared value of ~0.71.

Model Tuning:

The GBM model was then tuned against the test data set using 3 parameters: We can tune over the number of trees (i.e., boosting iterations), the complexity of the tree (indexed by interaction.depth) and the learning rate (also known as shrinkage). A gmb.grid was used to test multiple combinations of these 3 parameters testing against the on-time performance and late performance when using the business days to delivery. The results showed that 250 trees with an interaction depth of 5 and shrinkage of 0.05 would be an optimal parameters to use.

** Utilizing these parameters we ran the model against the test data and found that ontime performance improved from 64.4% to 76.8% (~19% improvement over the current process performance)**

Next Steps:

The key next steps include the following:

  1. Determine if we can better set the window range used to communicate delivery times to our customers based on more probabilistic range logic using a probability density function. We will need to ensure that any new range logic is competitive and acceptable to our customer base.
  2. Integrate the tuned GBM model into our real time process workflow
  3. Monitor model performance and continue to tune it as needed

Appendix:

Clustering:

One thought was to cluster the order data to determine if there were unique and meaningful clusters that could be identified and then build predictive models for each cluster to see if we could improve ontime performance.

Since the features used were mixed of mixed types (e.g.nominal and factor, all the pairwise dissimilarities (distances) between observations in the data set were calculated using “gower” distance. A hierarchical clustering algorithm was used.

To determine the optimal number of clusters, a silhouette width was calculated. 5 clusters were found to be the optimal number.

From the dendogram below, it looks like ~5 clusters would be most meaningful as well.

In order to better understand each cluster, they were plotted and analyzed. Clusters 1, 2 consisted of BTO product that were non CFI while cluster 3 was heavy BTO product with CFI. Cluster 4 was found to be mainly composed of BTS.

In order to determine the difference between clusters 1 & 2, they were further analyzed and it was found that cluster 1 was made up mainly 3-5 day shipment method while cluster 2 was composed more of premium next day and 2nd day shipment.

The idea was to split the data into BTO and BTS populations to determine if different models would work better on BTO vs. BTS, but in the end GBM was found to have the best performance for both BTS and BTO populations.

Clusters 1 & 2 both BTO showed similar performance while Cluster 4 (BTS) showed lower on-time performance as expected. The CFI BTO cluster #3 showed a little worse performance than the non CFI BTO clusters (1 & 2).